Towards  a  Unified  Approach  to  Memory-  and  Statistical-Based 

Machine  Translation 


Daniel  Marcu 

Information  Sciences  Institute  and 
Department  of  Computer  Science 
University  of  Southern  California 
4676  Admiralty  Way,  Suite  1001 
Marina  del  Rey,  CA  90292 
marcu@isi . edu 


Abstract 

We  present  a  set  of  algorithms  that  en¬ 
able  us  to  translate  natural  language 
sentences  by  exploiting  both  a  trans¬ 
lation  memory  and  a  statistical-based 
translation  model.  Our  results  show 
that  an  automatically  derived  transla¬ 
tion  memory  can  be  used  within  a  sta¬ 
tistical  framework  to  often  find  trans¬ 
lations  of  higher  probability  than  those 
found  using  solely  a  statistical  model. 

The  translations  produced  using  both 
the  translation  memory  and  the  sta¬ 
tistical  model  arc  significantly  better 
than  translations  produced  by  two  com¬ 
mercial  systems:  our  hybrid  system 
translated  perfectly  58%  of  the  505 
sentences  in  a  test  collection,  while 
the  commercial  systems  translated  per¬ 
fectly  only  40-42%  of  them. 

1  Introduction 

Over  the  last  decade,  much  progress  has  been 
made  in  the  fields  of  example-based  (EBMT)  and 
statistical  machine  translation  (SMT).  EBMT  sys¬ 
tems  work  by  modifying  existing,  human  pro¬ 
duced  translation  instances,  which  are  stored  in 
a  translation  memory  (TMEM).  Many  methods 
have  been  proposed  for  storing  translation  pairs 
in  a  TMEM,  finding  translation  examples  that 
are  relevant  for  translating  unseen  sentences,  and 
modifying  and  integrating  translation  fragments 
to  produce  correct  outputs.  Sato  (1992),  for  ex¬ 
ample,  stores  complete  parse  trees  in  the  TMEM 


and  selects  and  generates  new  translations  by 
performing  similarity  matchings  on  these  trees. 
Veale  and  Way  (1997)  store  complete  sentences; 
new  translations  are  generated  by  modifying  the 
TMEM  translation  that  is  most  similar  to  the  in¬ 
put  sentence.  Others  store  phrases;  new  trans¬ 
lations  are  produced  by  optimally  partitioning 
the  input  into  phrases  that  match  examples  from 
the  TMEM  (Maruyana  and  Watanabe,  1992),  or 
by  finding  all  partial  matches  and  then  choosing 
the  best  possible  translation  using  a  multi-engine 
translation  system  (Brown,  1999). 

With  a  few  exceptions  (Wu  and  Wong,  1998), 
most  SMT  systems  are  couched  in  the  noisy  chan¬ 
nel  framework  (see  Figure  1).  In  this  framework, 
the  source  language,  let’s  say  English,  is  assumed 
to  be  generated  by  a  noisy  probabilistic  source.1 
Most  of  the  current  statistical  MT  systems  treat 
this  source  as  a  sequence  of  words  (Brown  et  al., 
1993).  (Alternative  approaches  exist,  in  which  the 
source  is  taken  to  be,  for  example,  a  sequence  of 
aligned  templates/phrases  (Wang,  1998;  Och  et 
al.,  1999)  or  a  syntactic  tree  (Yamada  and  Knight, 
2001).)  In  the  noisy-channel  framework,  a  mono¬ 
lingual  corpus  is  used  to  derive  a  statistical  lan¬ 
guage  model  that  assigns  a  probability  to  a  se¬ 
quence  of  words  or  phrases,  thus  enabling  one  to 
distinguish  between  sequences  of  words  that  are 
grammatically  correct  and  sequences  that  are  not. 
A  sentence-aligned  parallel  corpus  is  then  used 
in  order  to  build  a  probabilistic  translation  model 

'For  the  rest  of  this  paper,  we  use  the  terms  source 
and  target  languages  according  to  the  jargon  specific  to  the 
noisy-channel  framework.  In  this  framework,  the  source  lan¬ 
guage  is  the  language  into  which  the  machine  translation 
system  translates. 
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argmax  P(e  I  f)  =  argmax  P(f  I  e)  P(e) 
e  e 

Figure  1 :  The  noisy  channel  model. 


that  explains  how  the  source  can  be  turned  into 
the  target  and  that  assigns  a  probability  to  every 
way  in  which  a  source  e  can  be  mapped  into  a  tar¬ 
get  f.  Once  the  parameters  of  the  language  and 
translation  models  arc  estimated  using  traditional 
maximum  likelihood  and  EM  techniques  (Demp¬ 
ster  et  ah,  1977),  one  can  take  as  input  any  string 
in  the  target  language  f,  and  find  the  source  e  of 
highest  probability  that  could  have  generated  the 
target,  a  process  called  decoding  (see  Figure  1). 

It  is  clear  that  EBMT  and  SMT  systems  have 
different  strengths  and  weaknesses.  If  a  sen¬ 
tence  to  be  translated  or  a  very  similar  one  can  be 
found  in  the  TMEM,  an  EBMT  system  has  a  good 
chance  of  producing  a  good  translation.  How¬ 
ever,  if  the  sentence  to  be  translated  has  no  close 
matches  in  the  TMEM,  then  an  EBMT  system  is 
less  likely  to  succeed.  In  contrast,  an  SMT  sys¬ 
tem  may  be  able  to  produce  perfect  translations 
even  when  the  sentence  given  as  input  does  not 
resemble  any  sentence  from  the  training  corpus. 
However,  such  a  system  may  be  unable  to  gener¬ 
ate  translations  that  use  idioms  and  phrases  that 
reflect  long-distance  dependencies  and  contexts, 
which  arc  usually  not  captured  by  current  transla¬ 
tion  models. 

This  paper  advances  the  state-of-the-art  in  two 
respects.  First,  we  show  how  one  can  use  an  ex¬ 
isting  statistical  translation  model  (Brown  et  al., 
1993)  in  order  to  automatically  derive  a  statistical 
TMEM.  Second,  we  adapt  a  decoding  algorithm 
so  that  it  can  exploit  information  specific  both  to 
the  statistical  TMEM  and  the  translation  model. 
Our  experiments  show  that  the  automatically  de¬ 
rived  translation  memory  can  be  used  within  the 
statistical  framework  to  often  find  translations  of 
higher  probability  than  those  found  using  solely 


the  statistical  model.  The  translations  produced 
losing  both  the  translation  memory  and  the  statisti¬ 
cal  model  arc  significantly  better  than  translations 
produced  by  two  commercial  systems. 

2  The  IBM  Model  4 


For  the  work  described  in  this  paper  we  used  a 
modified  version  of  the  statistical  machine  trans¬ 
lation  tool  developed  in  the  context  of  the  1999 
Johns  Hopkins’  Summer  Workshop  (Al-Onaizan 
et  ah,  1999),  which  implements  IBM  translation 
model  4  (Brown  et  al.,  1993). 

IBM  model  4  revolves  around  the  notion  of 
word  alignment  over  a  pair  of  sentences  (see  Fig¬ 
ure  2).  The  word  alignment  is  a  graphical  repre¬ 
sentation  of  an  hypothetical  stochastic  process  by 
which  a  source  string  e  is  converted  into  a  target 
string  f.  The  probability  of  a  given  alignment  a 
and  target  sentence  f  given  a  source  sentence  e  is 
given  by 
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where  the  factors  delineated  by  x  symbols  corre¬ 
spond  to  hypothetical  steps  in  the  following  gen¬ 
erative  process: 

•  Each  English  word  c,  is  assigned  with  prob¬ 
ability  n($j|ej)  a  fertility  </>,,  which  corre¬ 
sponds  to  the  number  of  French  words  into 
which  e  is  going  to  be  translated. 

•  Each  English  word  c,  is  then  translated  with 
probability  t(r^|ej)  into  a  French  word 
where  k  ranges  from  1  to  the  number  of 
words  ( pi  (fertility  of  c,)  into  which  e,  is 
translated.  For  example,  the  English  word 


“no”  in  Figure  2  is  a  word  of  fertility  2  that 
is  translated  into  “aucun”  and  “ne”. 

•  The  rest  of  the  factors  denote  distorsion 
probabilities  (d),  which  capture  the  proba¬ 
bility  that  words  change  their  position  when 
translated  from  one  language  into  another; 
the  probability  of  some  French  words  being 
generated  from  an  invisible  English  NULL 
element  (pi),  etc.  See  (Brown  et  ah,  1993) 
or  (Germann  et  al.,  2001)  for  a  detailed  dis¬ 
cussion  of  this  translation  model  and  a  de¬ 
scription  of  its  parameters. 

3  Building  a  statistical  translation 
memory 

Companies  that  specialize  in  producing  high- 
quality  human  translations  of  documentation  and 
news  rely  often  on  translation  memory  tools  to  in¬ 
crease  their  productivity  (Sprung,  2000).  Build¬ 
ing  high-quality  TMEM  is  an  expensive  process 
that  requires  many  person-years  of  work.  Since 
we  arc  not  in  the  fortunate  position  of  having  ac¬ 
cess  to  an  existing  TMEM,  we  decided  to  build 
one  automatically. 

We  trained  IBM  translation  model  4  on 
500,000  English-French  sentence  pairs  from 
the  Hansard  corpus.  We  then  used  the  Viterbi 
alignment  of  each  sentence,  i.e.,  the  alignment  of 
highest  probability,  to  extract  tuples  of  the  form 

(eii  e«+lj  :j  ei+k'i  fji  fj+ If  M  •  1  fj+l'i  aji  aj+li 
. . . ,  aj+i) ,  where  e, ,  ei+1 , . . . ,  ei+k  represents 
a  contiguous  English  phrase,  fjjj+i, . . . ,  fj+i 
represents  a  contiguous  French  phrase,  and 
Uj.Uj+i. ....  aJ+i  represents  the  Viterbi  align¬ 
ment  between  the  two  phrases.  We  selected 
only  “contiguous”  alignments,  i.e.,  alignments  in 
which  the  words  in  the  English  phrase  generated 
only  words  in  the  French  phrase  and  each  word 
in  the  French  phrase  was  generated  either  by  the 
NULL  word  or  a  word  from  the  English  phrase. 
We  extracted  only  tuples  in  which  the  English 
and  French  phrases  contained  at  least  two  words. 

For  example,  in  the  Viterbi  alignment  of  the 
two  sentences  in  Figure  2,  which  was  produced 
automatically,  “there”  and  “.”  arc  words  of  fertil¬ 
ity  0,  NULL  generates  the  French  lexeme  “is” 
generates  “est”,  “no”  generates  “aucun”  and  “ne”, 
and  so  on.  From  this  alignment  we  extracted  the 


NULL  there  is  no  one  union  involved  ■ 


aucun  syndicat  particulier  ne  est  en  cause 


Figure  2:  Example  of  Viterbi  alignment  produced 
by  IBM  model  4. 

six  tuples  shown  in  Table  1 ,  because  they  were  the 
only  ones  that  satisfied  all  conditions  mentioned 
above.  For  example,  the  pair  (  no  one  ;  aucun  syn¬ 
dicat  particulier  ne  )  does  not  occur  in  the  transla¬ 
tion  memory  because  the  French  word  “syndicat” 
is  generated  by  the  word  “union”,  which  does  not 
occur  in  the  English  phrase  “no  one”. 

By  extracting  all  tuples  of  the  form  (e;  /;  a) 
from  the  training  corpus,  we  ended  up  with  many 
duplicates  and  with  French  phrases  that  were 
paired  with  multiple  English  translations.  We 
chose  for  each  French  phrase  only  one  possible 
English  translation  equivalent.  We  tried  out  two 
distinct  methods  for  choosing  a  translation  equiv¬ 
alent,  thus  constructing  two  different  probabilistic 
TMEMs: 

•  The  Frequency-based  Translation  MEMory 
(FTMEM)  was  created  by  associating  with 
each  French  phrase  the  English  equivalent 
that  occurred  most  often  in  the  collection  of 
phrases  that  we  extracted. 

•  The  Probability-based  Translation  MEMory 
(PTMEM)  was  created  by  associating  with 
each  French  phrase  the  English  equivalent 
that  corresponded  to  the  alignment  of  high¬ 
est  probability. 

In  contrast  to  other  TMEMs,  our  TMEMs  explic¬ 
itly  encode  not  only  the  mutual  translation  pairs 
but  also  their  corresponding  word-level  align¬ 
ments,  which  arc  derived  according  to  a  certain 
translation  model  (in  our  case,  IBM  model  4). 
The  mutual  translations  can  be  anywhere  between 
two  words  long  to  complete  sentences.  Both 
methods  yielded  translation  memories  that  con¬ 
tained  around  11.8  million  word-aligned  transla¬ 
tion  pairs.  Due  to  efficiency  considerations  and 
memory  limitations  —  the  software  we  wrote 
loads  a  complete  TMEM  into  the  memory  —  we 
used  in  our  experiments  only  a  fraction  of  the 
TMEMs,  those  that  contained  phrases  at  most  10 


English 

French 

Alignment 

one  union 

syndicat  particulier 

one  —¥  {particulier};  union-*  {syndicat} 

no  one  union 

aucun  syndicat  particulier  ne 

no  -*  {aucun.  ne}; 

one  — *  {particulier};  union-*  {syndicat} 

is  no  one  union 

aucun  syndicat  particulier  ne  est 

is  -*  {est};  no  — *  {aucun,  ne}; 

one  — *  {particulier};  union—*  {syndicat} 

there  is  no  one  union 

aucun  syndicat  particulier  ne  est 

is  — *  {est};  no  — *  {aucun,  ne}; 

one  — *  {particulier};  union— *  {syndicat} 

is  no  one  union  involved 

aucun  syndicat  particulier  ne  est  en  cause 

is  — *  {est};  no  — *  {aucun,  ne}; 

one  — >  {particulier};  union-*  {syndicat} 

involved  — >  {en  cause} 

there  is  no  one  union  involved 

aucun  syndicat  particulier  ne  est  en  cause 

there  is  no  one  union  involved  . 

aucun  syndicat  particulier  ne  est  en  cause  . 

is  — *  {est};  no  -*  {aucun,  ne}; 

one  — *  {particulier};  union—*  {syndicat} 

involved  — >  {en  cause};  NULL  —¥  {  .  } 

Table  1:  Examples  of  automatically  constructed  statistical  translation  memory  entries. 


TMEM 

Perfect 

Almost 

perfect 

Incorrect 

Unable 
to  judge 

FTMEM 

PTMEM 

62.5% 

57.5% 

8.5% 

7.5% 

27.0% 

33.5% 

2.0% 

1.5% 

Table  2:  Accuracy  of  automatically  constructed 
TMEMs. 


words  long.  This  yielded  a  working  FTMEM  of 
4.1  million  and  a  PTMEM  of  5.7  million  phrase 
translation  pairs  aligned  at  the  word  level  using 
IBM  statistical  model  4. 

To  evaluate  the  quality  of  both  TMEMs  we 
built,  we  extracted  randomly  200  phrase  pairs 
from  each  TMEM.  These  phrases  were  judged  by 
a  bilingual  speaker  as 

•  perfect  translations  if  she  could  imagine  con¬ 
texts  in  which  the  aligned  phrases  could  be 
mutual  translations  of  each  other; 

•  almost  perfect  translations  if  the  aligned 
phrases  were  mutual  translations  of  each 
other  and  one  phrase  contained  one  single 
word  with  no  equivalent  in  the  other  lan¬ 
guage2; 

•  incorrect  translations  if  the  judge  could  not 
imagine  any  contexts  in  which  the  aligned 
phrases  could  be  mutual  translations  of  each 
other. 


2For  example,  the  translation  pair  “final  ,  le  secretaire 
de”  and  “final  act ,  the  secretary  of”  were  labeled  as  almost 
perfect  because  the  English  word  “act”  has  no  French  equiv¬ 
alent. 


The  results  of  the  evaluation  are  shown  in  Ta¬ 
ble  2.  A  visual  inspection  of  the  phrases  in  our 
TMEMs  and  the  judgments  made  by  the  evaluator 
suggest  that  many  of  the  translations  labeled  as  in¬ 
correct  make  sense  when  assessed  in  a  larger  con¬ 
text.  For  example,  “autres  regions  de  le  pays  que” 
and  “other  parts  of  Canada  than”  were  judged  as 
incorrect.  However,  when  considered  in  a  con¬ 
text  in  which  it  is  clear  that  “Canada”  and  “pays” 
corefer,  it  would  be  reasonable  to  assume  that  the 
translation  is  correct.  Table  3  shows  a  few  exam¬ 
ples  of  phrases  from  our  FTMEM  and  their  corre¬ 
sponding  correctness  judgments. 

Although  we  found  our  evaluation  to  be  ex¬ 
tremely  conservative,  we  decided  nevertheless  to 
stick  to  it  as  it  adequately  reflects  constraints  spe¬ 
cific  to  high-standard  translation  environments  in 
which  TMEMs  arc  built  manually  and  constantly 
checked  for  quality  by  specialized  teams  (Sprung, 
2000). 

4  Statistical  decoding  using  both  a 
statistical  TMEM  and  a  statistical 
translation  model 

The  results  in  Table  2  show  that  about  70%  of  the 
entries  in  our  translation  memory  arc  correct  or 
almost  correct  (very  easy  to  fix).  It  is,  though,  an 
empirical  question  to  what  extend  such  TMEMs 
can  be  used  to  improve  the  performance  of  cur¬ 
rent  translation  systems.  To  determine  this,  we 
modified  an  existing  decoding  algorithm  so  that  it 
can  exploit  information  specific  both  to  a  statisti¬ 
cal  translation  model  and  a  statistical  TMEM. 


English 

French 

Judgment 

,  but  I  cannot  say 

,  mais  je  ne  puis  dire 

correct 

how  did  this  all  come  about  ? 

comment  est-ce  arrivee  ? 

correct 

but ,  I  humbly  believe 

mais  ,  a  mon  humble  avis 

correct 

final  act ,  the  secretary  of 

final ,  le  secretaire  de 

almost  correct 

other  parts  of  Canada  than 

autres  regions  de  le  pays  que 

incorrect 

what  is  the  total  amount  accumulated 

a  combien  se  eleve  la 

incorrect 

that  party  present  this 

ce  parti  present  aujourd’hui 

incorrect 

the  airraft  company  to  present  further  studies 

de  autre  etudes 

incorrect 

Table  3:  Examples  of  TMEM  entries  with  correctness  judgments. 


The  decoding  algorithm  that  we  use  is  a  greedy 
one  —  see  (Germann  et  al.,  2001)  for  details.  The 
decoder  guesses  first  an  English  translation  for 
the  French  sentence  given  as  input  and  then  at¬ 
tempts  to  improve  it  by  exploring  greedily  alter¬ 
native  translations  from  the  immediate  translation 
space.  We  modified  the  greedy  decoder  described 
by  Germann  et  al.  (2001)  so  that  it  attempts  to 
find  good  translation  starting  from  two  distinct 
points  in  the  space  of  possible  translations:  one 
point  corresponds  to  a  word-for-word  “gloss”  of 
the  French  input;  the  other  point  corresponds  to 
a  translation  that  resembles  most  closely  transla¬ 
tions  stored  in  the  TMEM. 

As  discussed  by  Germann  et  al.  (2001),  the 
word-for-word  gloss  is  constructed  by  aligning 
each  French  word  f j  with  its  most  likely  En¬ 
glish  translation  cp  (cp  =  argmaxe  t(e  |  fj)). 
For  example,  in  translating  the  French  sentence 
“Bien  entendu  ,  il  parle  de  une  belle  victoire  .”, 
the  greedy  decoder  initially  assumes  that  a  good 
translation  of  it  is  “Well  heard  ,  it  talking  a  beauti¬ 
ful  victory”  because  the  best  translation  of  “bien” 
is  “well”,  the  best  translation  of  “entendu”  is 
“heal'd”,  and  so  on.  A  word-for-word  gloss  re¬ 
sults  (at  best)  in  English  words  written  in  French 
word  order. 

The  translation  that  resembles  most  closely 
translations  stored  in  the  TMEM  is  constructed 
by  deriving  a  “cover”  for  the  input  sentence  using 
phrases  from  the  TMEM.  The  derivation  attempts 
to  cover  with  translation  pairs  from  the  TMEM 
as  much  of  the  input  sentence  as  possible,  using 
the  longest  phrases  in  the  TMEM.  The  words  in 
the  input  that  are  not  part  of  any  phrase  extracted 
from  the  TMEM  are  glossed.  For  example,  this 
approach  may  start  the  translation  process  from 
the  phrase  “well ,  he  is  talking  a  beautiful  victory” 
if  the  TMEM  contains  the  pairs  (well ,  ;  bien  en¬ 


tendu  ,)  and  (he  is  talking;  il  parle)  but  no  pair 
with  the  French  phrase  “belle  victoire”. 

If  the  input  sentence  is  found  “as  is”  in  the 
translation  memory,  its  translation  is  simply  re¬ 
turned  and  there  is  no  further  processing.  Oth¬ 
erwise,  once  an  initial  alignment  is  created,  the 
greedy  decoder  tries  to  improve  it,  i.e.,  it  tries  to 
find  an  alignment  (and  implicitly  a  translation)  of 
higher  probability  by  modifying  locally  the  initial 
alignment.  The  decoder  attempts  to  find  align¬ 
ments  and  translations  of  higher  probability  by 
employing  a  set  of  simple  operations,  such  as 
changing  the  translation  of  one  or  two  words  in 
the  alignment  under  consideration,  inserting  into 
or  deleting  from  the  alignment  words  of  fertility 
zero,  and  swapping  words  or  segments. 

In  a  stepwise  fashion,  starting  from  the  ini¬ 
tial  gloss  or  initial  cover,  the  greedy  decoder  iter¬ 
ates  exhaustively  over  all  alignments  that  are  one 
such  simple  operation  away  from  the  alignment 
under  consideration.  At  every  step,  the  decoder 
chooses  the  alignment  of  highest  probability,  un¬ 
til  the  probability  of  the  current  alignment  can  no 
longer  be  improved. 

5  Evaluation 

We  extracted  from  the  test  corpus  a  collection 
of  505  French  sentences,  uniformly  distributed 
across  the  lengths  6,  7,  8,  9,  and  10.  For  each 
French  sentence,  we  had  access  to  the  human¬ 
generated  English  translation  in  the  test  corpus, 
and  to  translations  generated  by  two  commercial 
systems.  We  produced  translations  using  three 
versions  of  the  greedy  decoder:  one  used  only  the 
statistical  translation  model,  one  used  the  trans¬ 
lation  model  and  the  FTMEM,  and  one  used  the 
translation  model  and  the  PTMEM. 

We  initially  assessed  how  often  the  translations 
obtained  from  TMEM  seeds  had  higher  proba- 


Sent. 

length 

Found 

in 

FTMEM 

Fligher 

prob. 

from 

FTMEM 

Same 

result 

Higher 

prob. 

front 

gloss 

6 

33 

9 

43 

16 

7 

27 

9 

48 

17 

8 

29 

16 

42 

14 

9 

31 

15 

28 

27 

10 

31 

9 

43 

18 

All  (%) 

30% 

12% 

40% 

18% 

Table  4:  The  utility  of  the  FTMEM. 


Sent. 

length 

Found 

in 

FTMEM 

Higher 

prob. 

from 

FTMEM 

Same 

result 

Higher 

prob. 

front 

gloss 

6 

33 

9 

43 

16 

7 

27 

10 

50 

14 

8 

30 

16 

41 

14 

9 

31 

15 

36 

19 

10 

31 

15 

31 

13 

All  (%) 

31% 

13% 

41% 

15% 

Table  5 :  The  utility  of  the  PTMEM. 


bility  than  the  translations  obtained  from  simple 
glosses.  Tables  4  and  5  show  that  the  transla¬ 
tion  memories  significantly  help  the  decoder  find 
translations  of  high  probability.  In  about  30% 
of  the  cases,  the  translations  arc  simply  copied 
from  a  TMEM  and  in  about  13%  of  the  cases 
the  translations  obtained  from  a  TMEM  seed  have 
higher  probability  that  the  best  translations  ob¬ 
tained  from  a  simple  gloss.  In  40%  of  the  cases 
both  seeds  (the  TMEM  and  the  gloss)  yield  the 
same  translation.  Only  in  about  15-18%  of  the 
cases  the  translations  obtained  from  the  gloss 
arc  better  than  the  translations  obtained  from  the 
TMEM  seeds.  It  appears  that  both  TMEMs  help 
the  decoder  find  translations  of  higher  probability 
consistently,  across  all  sentence  lengths. 

In  a  second  experiment,  a  bilingual  judge 
scored  the  human  translations  extracted  from  the 
automatically  aligned  test  corpus;  the  transla¬ 
tions  produced  by  a  greedy  decoder  that  use  both 
TMEM  and  gloss  seeds;  the  translations  produced 
by  a  greedy  decoder  that  uses  only  the  statistical 
model  and  the  gloss  seed;  and  translations  pro¬ 
duced  by  two  commercial  systems  (A  and  B). 

•  If  an  English  translation  had  the  very  same 
meaning  as  the  French  original,  it  was  con¬ 
sidered  semantically  correct.  If  the  mean¬ 
ing  was  just  a  little  different,  the  transla¬ 


tion  was  considered  semantically  incorrect. 
For  example,  “this  is  rather  provision  dis¬ 
turbing”  was  judged  as  a  correct  semantical 
translation  of  “voila  une  disposition  plotot 
inquietante”,  but  “this  disposal  is  rather  dis¬ 
turbing”  was  judged  as  incorrect. 

•  If  a  translation  was  perfect  from  a  gram¬ 
matical  perspective,  it  was  considered  to  be 
grammatical.  Otherwise,  it  was  considered 
incorrect.  For  example,  “this  is  rather  pro¬ 
vision  disturbing”  was  judged  as  ungram¬ 
matical,  although  one  may  very  easily  make 
sense  of  it. 

We  decided  to  use  such  harsh  evaluation  criteria 
because,  in  previous  experiments,  we  repeatedly 
found  that  harsh  criteria  can  be  applied  consis¬ 
tently.  To  ensure  consistency  during  evaluation, 
the  judge  used  a  specialized  interface:  once  the 
correctness  of  a  translation  produced  by  a  system 

5  was  judged,  the  same  judgment  was  automati¬ 
cally  recorded  with  respect  to  the  other  systems  as 
well.  This  way,  it  became  impossible  for  a  trans¬ 
lation  to  be  judged  as  correct  when  produced  by 
one  system  and  incorrect  when  produced  by  an¬ 
other  system. 

Table  6,  which  summarizes  the  results,  displays 
the  percent  of  perfect  translations  (both  semanti¬ 
cally  and  grammatically)  produced  by  a  variety  of 
systems.  Table  6  shows  that  translations  produced 
using  both  TMEM  and  gloss  seeds  are  much  bet¬ 
ter  than  translations  that  do  not  use  TMEMs. 
The  translation  systems  that  use  both  a  TMEM 
and  the  statistical  model  outperform  significantly 
the  two  commercial  systems.  The  figures  in  Ta¬ 
ble  6  also  reflect  the  harshness  of  our  evaluation 
metric:  only  82%  of  the  human  translations  ex¬ 
tracted  from  the  test  corpus  were  considered  per¬ 
fect  translation.  A  few  of  the  errors  were  gen¬ 
uine,  and  could  be  explained  by  failures  of  the 
sentence  alignment  program  that  was  used  to  cre¬ 
ate  the  corpus  (Melamed,  1999).  Most  of  the  er¬ 
rors  were  judged  as  semantic,  reflecting  directly 
the  harshness  of  our  evaluation  metric. 

6  Discussion 

The  approach  to  translation  described  in  this  pa¬ 
per  is  quite  general.  It  can  be  applied  in  con¬ 
junction  with  other  statistical  translation  mod- 


Sentence 

length 

Humans 

Greedy  with 
FTMEM 

Greedy  with 
PTMEM 

Greedy  without 
TMEM 

Commercial 
system  A 

Commercial 
system  B 

6 

92 

72 

70 

52 

55 

59 

7 

73 

58 

52 

37 

42 

43 

8 

80 

53 

52 

30 

38 

29 

9 

84 

53 

53 

37 

40 

35 

10 

85 

57 

60 

36 

40 

37 

All(%) 

82% 

58% 

57% 

38% 

42% 

40% 

Table  6:  Percent  of  perfect  translations  produced  by  various  translation  systems  and  algorithms. 


els.  And  it  can  be  applied  in  conjunction  with 
existing  translation  memories.  To  do  this,  one 
would  simply  have  to  train  the  statistical  model  on 
the  translation  memory  provided  as  input,  deter¬ 
mine  the  Viterbi  alignments,  and  enhance  the  ex¬ 
isting  translation  memory  with  word-level  align¬ 
ments  as  produced  by  the  statistical  translation 
model.  We  suspect  that  using  manually  produced 
TMEMs  can  only  increase  the  performance  as 
such  TMEMs  undergo  periodic  checks  for  qual¬ 
ity  assurance. 

The  work  that  comes  closest  to  using  a  sta¬ 
tistical  TMEM  similar  to  the  one  we  propose 
here  is  that  of  Vogel  and  Ney  (2000),  who  au¬ 
tomatically  derive  from  a  parallel  corpus  a  hier¬ 
archical  TMEM.  The  hierarchical  TMEM  con¬ 
sists  of  a  set  of  transducers  that  encode  a  sim¬ 
ple  grammar.  The  transducers  arc  automatically 
constructed:  they  reflect  common  patterns  of  us¬ 
age  at  levels  of  abstractions  that  arc  higher  than 
the  words.  Vogel  and  Ney  (2000)  do  not  evaluate 
their  TMEM-based  system,  so  it  is  difficult  to  em¬ 
pirically  compare  their  approach  with  ours.  From 
a  theoretical  perspective,  it  appeal's  though  that 
the  two  approaches  are  complementary:  Vogel 
and  Ney  (2000)  identify  abstract  patterns  of  usage 
and  then  use  them  during  translation.  This  may 
address  the  data  sparseness  problem  that  is  char¬ 
acteristic  to  any  statistical  modeling  effort  and 
produce  better  translation  parameters. 

In  contrast,  our  approach  attempts  to  stir  the 
statistical  decoding  process  into  directions  that 
are  difficult  to  reach  when  one  relies  only  on 
the  parameters  of  a  particular  translation  model. 
For  example,  the  two  phrases  “il  est  mort”  and 
“he  kicked  the  bucket”  may  appeal'  only  in  one 
sentence  in  an  arbitrary  large  corpus.  The  pa¬ 
rameters  learned  from  the  entire  corpus  will  very 
likely  associate  very  low  probability  to  the  words 


“kicked”  and  “bucket”  being  translated  into  “est” 
and  “mort”.  Because  of  this,  a  statistical-based 
MT  system  will  have  trouble  producing  a  trans¬ 
lation  that  uses  the  phrase  “kick  the  bucket”,  no 
matter  what  decoding  technique  it  employs.  How¬ 
ever,  if  the  two  phrases  are  stored  in  the  TMEM, 
producing  such  a  translation  becomes  feasible. 

If  optimal  decoding  algorithms  capable  of 
searching  exhaustively  the  space  of  all  possible 
translations  existed,  using  TMEMs  in  the  style 
presented  in  this  paper  would  never  improve  the 
performance  of  a  system.  Our  approach  works 
because  it  biases  the  decoder  to  search  in  sub¬ 
spaces  that  are  likely  to  yield  translations  of  high 
probability,  subspaces  which  otherwise  may  not 
be  explored.  The  bias  introduced  by  TMEMs  is 
a  practical  alternative  to  finding  optimal  transla¬ 
tions,  which  is  NP-complete  (Knight,  1999). 

It  is  cleai'  that  one  of  the  main  strengths  of  the 
TMEM  is  its  ability  to  encode  contextual,  long¬ 
distance  dependencies  that  are  incongruous  with 
the  parameters  learned  by  current  context  poor, 
reductionist  channel  models.  Unfortunately,  the 
criterion  used  by  the  decoder  in  order  to  choose 
between  a  translation  produced  starting  from  a 
gloss  and  one  produced  starting  from  a  TMEM 
is  biased  in  favor  of  the  gloss-based  translation.  It 
is  possible  for  the  decoder  to  produce  a  perfect 
translation  using  phrases  from  the  TMEM,  and 
yet,  to  discard  the  perfect  translation  in  favor  of 
an  incorrect  translation  of  higher  probability  that 
was  obtained  from  a  gloss  (or  from  the  TMEM). 
It  would  be  desirable  to  develop  alternative  rank¬ 
ing  techniques  that  would  permit  one  to  prefer  in 
some  instances  a  TMEM-based  translation,  even 
though  that  translation  is  not  the  best  according 
to  the  probabilistic  channel  model.  The  examples 
in  Table  7  shows  though  that  this  is  not  trivial:  it 
is  not  always  the  case  that  the  translation  of  high- 


Translations 

Does  this  translation 
use  TMEM 
phrases? 

Is  this 
translation 
correct? 

Is  this  the  translation 
of  highest 
probability? 

monsieur  le  president ,  je  aimerais  savoir  . 
mr.  speaker  ,  i  would  like  to  know  . 

yes 

yes 

yes 

mr.  speaker  ,  i  would  like  to  know  . 

no 

yes 

yes 

je  ne  peux  vous  entendre  ,  brian  . 
i  cannot  hear  you  ,  brian  . 

yes 

yes 

yes 

i  can  you  listen  ,  brian  . 

no 

no 

no 

alors  ,  je  termine  la  -  dessus  . 
therefore  ,  i  will  conclude  my  remarks  . 

yes 

yes 

no 

therefore  ,  i  conclude  -  over  . 

no 

no 

yes 

Table  7:  Example  of  system  outputs,  obtained  with  or  without  TMEM  help. 


est  probability  is  the  perfect  one.  The  first  French 
sentence  in  Table  7  is  correctly  translated  with  or 
without  help  from  the  translation  memory.  The 
second  sentence  is  correctly  translated  only  when 
the  system  uses  a  TMEM  seed;  and  fortunately, 
the  translation  of  highest  probability  is  the  one 
obtained  using  the  TMEM  seed.  The  translation 
obtained  from  the  TMEM  seed  is  also  correct  for 
the  third  sentence.  But  unfortunately,  in  this  case, 
the  TMEM -based  translation  is  not  the  most  prob¬ 
able. 
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